Method and apparatus for automatic detection of data types for data type dependent processing

ABSTRACT

A method for automatic detection of data types for data type dependent processing has two orthogonal classification systems defined, and determines for incoming data items a data type according to the first classification system and another data type according to the second classification system. The first classification system comprises the data types Essence (E), Metadata (M) and Container (C). The second classification system comprises the data types Physical Data (PD) and Abstract Data (AD). A default data type may be defined for data items not being uniquely classifiable. Advantageously, the inventive method can be used when different classes of data items require different methods for processing, e.g. content searching.

The invention relates to a method and an apparatus for theclassification, organization and structuring of different types of data,which can be used e.g. for data sorting, data storage or data retrieval.

BACKGROUND

The capacity of digital storage media like hard disks or rewritableoptical disks for personal recording of video and other data growscontinuously. This results in new concepts like e.g. the so-called homeserver, which is a central storage device with large capacity forrecording any kind of data within the home. Such applications alsorequire new ways to organize the recorded data, search for content andaccess specific recordings.

For this purpose data about data, often referred to as metadata, can beused. Various industry groups and standard bodies have been developingmetadata standards for different purposes and, applications. Inmultimedia applications, metadata typically are data about audiovisual(AV) data, these AV data often being called ‘essence’. However, a DataBase Management System (DBMS) that shall be able to handle data ofvarious data types correctly requires a definition of data types, and amethod to distinguish between them.

INVENTION

The invention is based on the recognition of the facts described in thefollowing:

In devices providing a DBMS for handling of incoming data, includingincoming metadata, it is necessary to classify said incoming data, andespecially incoming metadata, since different processing is necessaryfor different kinds of metadata. For example, a text query is notsuitable for metadata containing a picture in the well-known GraphicsInterchange Format (GIF).

The problem to be solved by the invention is to classify the dataautomatically, such that a DBMS can utilize the result of theclassification for correct data handling. This problem is solved by themethod disclosed in claim 1 and by the apparatus disclosed in claim 5.The output of such apparatus may be directed towards e.g. a DBMS.

According to the invention, Metadata can be defined as data setsconsisting of two parts, namely a first part being a link, the linkpointing to a reference data set, and a second part being any datareferring to said link. In the following, said first part is referred toas MD_LINK, and said second part is referred to as MD_LOAD. Any dataitem that does not contain at least one MD_LINK and a related MD_LOAD isdefined to be Essence. Metadata often occur together with other Metadataor Essence, combined in a logical entity like e.g. a file on a harddisc. Such mixture of different kinds of Essence and Metadata is in thefollowing called ‘Container’. Popular examples for such Containers areHypertext Markup Language (HTML) files, or Portable Document Format(PDF) files.

Further, according to the invention there is another type ofclassification possible. Data may require interpretation through thedevice before they can be used. In this case the data are defined to bePhysical Data, if the device has a method for interpretation defined,otherwise Abstract Data. If e.g. a picture is stored in GIF format, andthe device can interpret GIF format and display it as a picture, it isclassified as Physical Data. If the device cannot interpret GIF format,the data is classified as Abstract Data. Further examples for AbstractData are text files, and other files that cannot be interpreted throughthe device.

The previously defined two types of classification are not exclusive,but complementing each other. Further, the described classification ofdata is not absolute, but system dependent, and therefore only locallyrelevant.

Advantageously, this classification allows the device to handledifferent data types correctly, differ between Metadata, Essence,Container, Physical Data and Abstract Data, and thus permit ageneralized access method upon said data types. With this knowledge, thedevice can decide e.g. which type of data-query to use, how to interpretdata, and if some data can be disregarded for a certain query.

Advantageous additional embodiments of the invention are disclosed inthe following text, and in the respective dependent claims.

DRAWINGS

Exemplary embodiments of the invention are described with reference tothe accompanying drawings, which show in:

FIG. 1 the two systems, or dimensions, of data classification;

FIG. 2 an example for a Container containing Essence and Metadata;

FIG. 3 an example for Abstract Metadata;

FIG. 4 an example for Physical Metadata; and

FIG. 5 an exemplary flow-chart for the method according to theinvention.

EXEMPLARY EMBODIMENTS

According to the invention, the two types, or systems, of classificationcan be understood as two dimensions, as shown in FIG. 1. A data item mayeither be Essence E or Metadata M, and either Physical Data PD orAbstract Data AD. Therefore the possible data types are Physical EssencePE, Physical Metadata PM, Abstract Essence AE or Abstract Metadata AM.Further, a data item may also be a Container C, if it contains otherdata items.

The classification of data is not absolute, but subjective from thesight of the device, and therefore only relevant within a system, e.g.DBMS. It may happen that e.g. one system can interpret a link whileanother system cannot interpret the same link. Therefore it may happenthat e.g. one system classifies certain data as Metadata, consisting ofMD_LOAD and MD_LINK, while another system classifies the same data asEssence because it cannot interpret the link. Another example is thate.g. one system can reproduce an MPEG audio layer 3, or MP₃, coded file,while another system cannot interpret the MP₃ format. In this case thefirst system classifies an MP₃ coded file as Physical Data, but thesecond system classifies the same file as Abstract Data.

Text is to be regarded as Abstract Data, because text is always a formatfor saving data. Formatted text can represent a direct physicalrepresentation of data, e.g. the PDF format. The format informationrepresents only support information, i.e. if format information isextracted from a PDF file, the pure text being the main information willremain. If the text is extracted, the main information will be lost. Dueto the fact that the text represents the main information, alsoformatted text will be regarded as Abstract Data.

A device as disclosed in claim 5 will execute the following procedurewhen receiving data on its input:

-   -   If the data contain more than one data item, the output may be:        “Data is a Container”. More details are given below.        Classification may stop here, or may be extended to some, or        all, leaves of the hierarchically structured data tree within        the Container.    -   If data are Metadata, the output may be: “Data are Metadata”.    -   Otherwise the output may be “Data are Essence”.    -   If data are Physical Data, an additional output may be “Data are        Physical Data”.    -   Otherwise, if data are Abstract Data, an additional output may        be “Data are Abstract Data”.    -   Advantageously the device can detect and output the type of        Physical Data, e.g. “Data is a color picture (24 bit) with the        resolution x=200 pixels and y=400 pixels”.    -   If the data format is unknown to the device, and therefore the        device is not able to classify the data as Container, Metadata,        Essence, Abstract Data or Physical Data, the output may be any        default-type output, e.g. “Data type is unknown” or “Data are        Essence and Abstract Data”.

Additionally, it is helpful if the device detects whether data is textor not:

-   -   If data is Abstract Data and text, the output may be        additionally “Data is Text”.

This may be implemented by searching for known words, e.g. from anelectronic dictionary, or searching for groups of characters separatedby blanks.

If the input data is a Container, an additional output may be “Data is aContainer, i.e. more metadata or essence are contained”. Optionally,precise details can be included: “The Container CONTAINS at least 1Metadata and 1 Essence”, or “The Container CONTAINS no Metadata at all”or even “The Container CONTAINS exactly N Metadata items”, with N beingthe amount of Metadata contained in the Container.

If the device can detect the format of the analyzed data, it may outputit additionally: “Data format is X”. ‘X’ is the format. Examples for ‘X’can be e.g. ‘HTML’ or ‘JPEG’.

FIG. 2 shows an example for a data file containing a combination ofEssence and Metadata in the well-known HTML format. In the following,the classification of all elements according to the invention isdescribed.

First the device detects that the first line is <html>, and thattherefore the data file should be HTML formatted. It is assumed that thedevice can interpret the HTML format, and therefore interprets itemswith “href” attributes in HTML files as links. Since HTML formattedfiles usually contain a hierarchical structure, the leaf elements of thehierarchy tree are analyzed first. The first element from FIG. 2

-   -   <title>This is the title</title>        is classified as Essence because there is no link attached to        the element.

The element

-   -   <a href=http://www.w3c.org>W3C HOME</a>        is classified as Metadata, with the string “W3C HOME” being the        Essence, or MD_LOAD, and the string “href=http://w3c.org” being        the related link, or MD_LINK.

The next leaf element

-   -   <p>This is a paragraph</p>        contains no link and is therefore classified as Essence.

The next leaf element

-   -   <img src=“image.gif”>        is also classified as Essence because it is only a link, i.e. it        contains no MD_LINK with related MD_LOAD. Therefore it cannot be        Metadata. The purpose of this link is to reference further        Essence, namely the picture data.

When all elements of the first level of hierarchy are analyzed, the nextlevel is investigated. The element

-   -   <head>        -   <title>This is the title</title>    -   </head>        is classified as Essence because it contains no link, but only        one element, the element being Essence.

The element <a href=http://www.w3c.org>  <img src=”image.gif”> </a>is classified as Metadata, with <img src=“image.gif”>being the MD_LOADpart and the “href” attribute being the related link.

The next element <body>  ... </body>is classified as Container because it groups together Metadata items andEssence items.

Finally, the element <html>  ... </html>is also classified as Container. It groups together an Essence element,namely the <head>element, and a Container, namely the <body>element.

FIG. 3 shows an example for Abstract Metadata. Several data items 3R,3Mare grouped in a data unit 3C. The data unit 3C could be e.g. an HTMLfile. For one of said data items the device has detected that itcontains a link 3L, symbolized by the cursor switching from an arrow toa hand when pointing to the text 3E. Since the text 3E and the link 3Lbelong together, and the text 3E is Essence, they form a Metadata item3M, and the link 3L is a Metadata link pointing to a reference 3REFoutside the data unit 3C. Since the Essence 3E of the Metadata item 3Mis text, and text is Abstract Data, the Metadata item 3M is an AbstractMetadata item. Remaining data items 3R within the data unit 3C are anytext and a picture. The data unit 3C is a Container, since it containsat least one Metadata item 3M and other, remaining data items 3R.

FIG. 4 shows an example for Physical Metadata. Several data items 4R,4Mare contained in a data unit 4C, the unit 4C being e.g. an HTML file. Inthis case, the device has detected that the picture 4E is associated toa link 4L, symbolized by the cursor switching from an arrow to a hand.The link 4L is pointing to a reference 4REF outside the data unit 4C.Since the picture 4E and the link 4L belong together, they form aMetadata item 4M, with the picture 4E being the Essence of thisMetadata. Said Essence 4E is e.g. a JPEG formatted picture, and in theHTML file it may be referenced e.g. as <img src=Anton.jpg width=108height=73>. Since the device can display it, it is Physical Data, andthe Metadata item 4M is Physical Metadata. The data unit 4C is acontainer, because it contains at least one Metadata item 4M and otheritems 4R.

FIG. 5 shows an exemplary flow chart of the inventive method. Thepurpose of the invention is to classify different types of incoming dataIN. The incoming data IN are being analyzed, and a first decision blockD1 decides whether the format of the incoming data can be detected. Ifnot, ‘Unknown’ is indicated as an output, and the classificationfinishes at an end state EX. If the format is known, e.g. HTML, then asecond decision block D2 may decide if the incoming data containsunclassified elements. If the answer is ‘Yes’, the next unclassifieddata item is picked and forwarded to a third decision block D3. Thisdecision block D3 may decide whether said data item is a Container C,Metadata M or Essence E. The decision is ‘Container’ if the data itemcontains another data item already classified as Metadata. The decisionis ‘Metadata’ if the data item contains a link with essence relating tothat link. In all other cases the decision is ‘Essence’. The decisionmade in the third decision block D3 is indicated at the output. If theanalyzed data item is a Container C, then the procedure returns to thesecond decision block D2 again, otherwise a fourth decision block D4 isentered. Said fourth decision block D4 decides whether the device caninterpret the data item, such that it may disclose further informationto the user, e.g. a displayable picture. If the answer is ‘Yes’, it isindicated at the output that said data item is Physical Data PD,otherwise Abstract Data AD. In the case of said data item being PhysicalData PD, format detection may have been done implicitly in said fourthdecision block D4. Then a fifth decision block D5 may detect formatdetails and decide whether the detected format shall be indicated, andif so, the format F1, . . . , F3 may be indicated at the output. In thecase of said data item being Abstract Data AD, a sixth decision block D6may decide if the data contains text. If so, this is indicated at theoutput. If the data item is Abstract Data AD and not text, no furtherindication is generated. Then the procedure is repeated from the seconddecision block D2 that decides if further unclassified elements arecontained. If this is not the case, then the data item has beenclassified completely and the end state EX is entered. This embodimentof the invention analyzes all hierarchy levels and leaf elements ofContainers, but other embodiments may analyze only some hierarchy levelsor leaf elements of Containers.

Advantageously, the described method for data classification can be usedin devices for data sorting, data storage e.g. DBMS, or data retrievale.g. browsers. The described method can be used when different classesof data require different processing, e.g. different search algorithms,different storage methods or areas, different compression methods ordifferent presentation methods.

The invention can be implemented in a separate device, which willclassify incoming data with respect to its format, content, and relationto other data, e.g. link, and which provides information about data.This information is especially necessary when it is to recognize,whether these data contain links or these data need specialquery-methods.

The device can be part of another device or can be realized as hardwareor software, e.g. as an application or a plug-in in a PC. Further, itcan be updated, e.g. via the Internet or via other sources, so that moreand more formats can be recognized, thus this device will update itselfand get more and more efficient.

1. Method for automatic detection of data types for data type dependentprocessing by a technical device, characterized in comprising the stepsof: a) receiving data of different data types, b) analyzing saidreceived data, c) detecting the format of the received data, d) usingsaid detected format for evaluating whether said data contain at leastone machine-interpretable link and associated data, any other dataexcept data of said first type, or a mixture of saidmachine-interpretable link and associated data with said other data, e)evaluating whether said technical device is able to interpret said datafor reproducing a physical representation of said data, and f) supplyingthe result of said first evaluation and the result of said secondevaluation to a device or process for data type dependent processing ofsaid data.
 2. Method according to claim 1, wherein for data beinginterpretable by said technical device is also indicated whether theformat type of said data is one of a number of specified format types.3. Method according to claim 1, wherein for data being not interpretableby said technical device is also indicated if it is text.
 4. Methodaccording to claim 1, wherein said technical device is a data sortingdevice, a database management system or a data content browser. 5.Apparatus for automatic detection of data types for data type dependentprocessing according to claim
 1. 6. Method for automatic detection ofdata types as to categorize data according to said data types,comprising the steps of: receiving data; determining if said receiveddata is a container data type; determining said received data is atleast one of a metadata data type and essence data type, when saidreceived data is not determined to be of said container data type; anddetermining said received data is at least one of physical data type andabstract data type, after said step of determining whether said receiveddata is metadata data type and essence data type.
 7. Method according toclaim 6, wherein said received data is determined to be said containerdata type when a portion of data selected from said received data hasbeen previously determined being said metadata data type.
 8. Methodaccording to claim 6, wherein said container data type comports to anHTML compatible data format.
 9. Method according to claim 6, whereinsaid received data is determined to be said metadata data type when saiddata comprises a link with an essence related to said link.
 10. Methodaccording to claim 9, wherein said received data is determined to besaid essence data type instead of said metadata data type.
 11. Methodaccording to claim 6, wherein said received data is determined to besaid physical data type when said received data is capable of being ofbeing interpreted by a device implementing said method.
 12. Methodaccording to claim 11, wherein said received data is determined to beabstract data type instead of said physical data type.
 13. Methodaccording to claim 11, wherein said interpretation by said device isdisplaying said received data as a picture.